Performing regression analysis on personal data records

ABSTRACT

Disclosed herein are system, method, and computer program product embodiments for performing a regression analysis on lawfully collected personal data records. The analysis enables discovery of individuals likely to perform certain actions based on their personal data records and the personal data records and actions of others. The disclosed system, method, and computer program product may process vast quantities of data, including personal data records with thousands of categories and lawfully stored databases with millions of personal data records. Through the regression analysis, the disclosed system, method, and computer program product learn the most relevant categories for predicting an individual&#39;s actions based on input data provided by a user. The analysis then analyzes the categories of personal data records stored in a lawfully stored database to predict actions of individuals associated with those records and outputs results to the user.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a Continuation of U.S. patent application Ser. No.15/073,007, filed Mar. 17, 2016, which is incorporated herein byreference in its entirety.

FIELD OF THE INVENTION

The present invention generally relates to the field of data mining. Inparticular, the present invention relates to a computer-based method andsystem operable to predict an action of an individual based on personaldata records of a plurality of individuals.

BACKGROUND

Systems exist for lawfully collecting information describingcharacteristics or behavior of different people. Lawfully collectingsuch personal information has many applications, including in politicaland other fundraising, healthcare, marketing, and other fields. Anaction or transaction may generate data records specific to that actionand the individual who performed it. For example, the major creditbureaus maintain and sell access to databases of personal financial datarecords for nearly every individual with a line of credit (e.g., acredit card, auto loan, mortgage, etc.) in the United States. As anotherexample, databases with information describing mortgage information alsoare lawfully available.

Databases of personal data records may contain distinct recordscorresponding to the same individual. For example, an individual mayhave multiple mortgages over the course of a lifetime. Other types oflawfully available databases may maintain a single data record for anindividual or social security number. Such records may be updatedperiodically or as events occur that affect an individual's data record.

A personal data record may include a number of categories. A data recordrepresenting an individual mortgage may include categories such as thename of the individual, his or her city, state, and ZIP code, theindividual's employer, the name of the mortgage provider, the interestrate, and the amount of the loan. Data records from different sourcesmay comprise different categories.

Such personal data may be used to predict whether an individual willengage in particular behavior. For example, the personal data may beused to predict whether an individual is likely to buy a product orparticipate in a marketing campaign.

Improved techniques for predicting behavior from personal informationare needed.

BRIEF SUMMARY

The present disclosure provides a method and system operable to predictan action of an individual based on personal data records of a pluralityof individuals. The disclosed method and system may utilize knowledgethat a certain individual performed an action, as well as the personaldata records of the individual who performed the action, to findindividuals who are likely to perform the same or similar action.

In an embodiment, the present disclosure provides a method forpredicting an action of an individual based on a plurality of personaldata records. The method operates on a training set and a data set. Thetraining set comprises a plurality of personal data training records, aplurality of categories associated with each personal data trainingrecord, and an action taken by an individual corresponding to theassociated personal data training record. In an embodiment, the data setcomprises a number of personal data records greater than the number ofpersonal data training records in the training set. The method includesaccessing the training set stored in memory and determining a subset ofcategories based on at least one personal data training record in thetraining set. The method then determines a prediction function thatoutputs an outcome score of a personal data record based on values ofthe subset of categories, and tests the accuracy of the predictionfunction based on at least one personal data training record in thetraining set. The method continues by accessing the data set andprocessing a subset of the personal data records in the data set basedon the prediction function to determine an outcome score for eachpersonal data record in the subset of personal data records.

System and computer program products are also disclosed.

Further embodiments, features, and advantages of the invention, as wellas the structure and operation of the various embodiments, are describedin detail below with reference to accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

A further understanding of the present disclosure can be obtained byreference to the preferred embodiment and alternate embodiments setforth in the illustrations of the accompanying drawings. Although theillustrated embodiments are merely exemplary of systems for carrying outthe present invention, both the organization and method of operation ofthe invention, in general, together with further objectives andadvantages thereof, may be more easily understood by reference to thedrawings and the following description. The drawings are not intended tolimit the scope of this disclosure, which is set forth withparticularity in the claims as appended or as subsequently amended, butmerely to clarify and exemplify the invention. For a more completeunderstanding of the present disclosure, reference is now made to thefollowing drawings in which:

FIG. 1 illustrates an example input and output of a modeling engine,according to an embodiment of the present invention.

FIG. 2 is a flowchart of an exemplary method for predicting actionsbased on personal data records.

FIG. 3 is a flowchart of an exemplary method for training a scoringengine.

FIG. 4 illustrates an embodiment of the functional components of asystem for predicting actions based on personal data records.

FIG. 5 illustrates a first exemplary personal data record as it isprocessed by the system and used to predict an action for a secondexemplary personal data record.

In the drawings, like reference numbers generally indicate identical orsimilar elements. Additionally, generally, the left-most digit(s) of areference number identifies the drawing in which the reference numberfirst appears.

DETAILED DESCRIPTION

Embodiments use regression analysis on personal data records to predictbehavior of individuals corresponding to those records. As is set outbelow, a regression model is trained using a data set with data aboutindividuals and their past behavior. The trained regression model isused to forecast whether other individuals will engage in the samebehavior.

As required, a detailed illustrative embodiment of the present inventionis disclosed herein. However, techniques, systems and operatingstructures in accordance with the present disclosure may be embodied ina wide variety of forms and modes, some of which may be quite differentfrom those in the disclosed embodiment. Consequently, the specificstructural and functional details disclosed herein are merelyrepresentative, yet in that regard, they are deemed to afford the bestembodiment for purposes of disclosure and to provide a basis for theclaims herein, which define the scope of the present invention. Thefollowing presents a detailed description of a preferred embodiment aswell as alternate embodiments such as a simpler embodiment or morecomplex embodiments for alternate devices of the present invention.

FIG. 1 illustrates an example input and output of a modeling engine 100,according to an embodiment. Modeling engine 100 receives an input set110 of personal data records 112 a-112 n and outputs an output set 120of personal data records 122 a-122 n.

Each personal data record 112 in the input set 110 comprises a pluralityof categories 111 a-111 n. In an embodiment, each personal data record112 in the input set 110 comprises a last name 111 a, first name 111 b,age 111 c, address 111 d, and action 111 n. As a skilled artisan wouldunderstand, the input set 110 is not limited to the disclosed categoriesand may include a very large number of categories, both textual andnumerical. Furthermore, categories 111 are not limited to the formattingillustrated in FIG. 1. For example, categories 111 a (“Last Name”) and111 b (“First Name”) may instead be combined into a single categorytitled “Full Name.”

Each personal data record 112 in input set 110 includes an action 111 n.For each personal data record 112 in input set 110, the action category111 n is a representation of an action performed, or not performed, bythe individual corresponding to the personal data record 112. Action 111n may describe, for example, whether an individual associated with acertain personal data record 112 subscribed to a specific newsletter orpurchased a specific product. However, input set 110 need not provideany indication as to what underlying action the action category 111 npertains to. In an embodiment, there may not be a single underlyingaction for action category 111 n. For example, for personal data record112 a, the value in action category 111 n may correspond to whetherAaron Anderson has a mortgage, whereas for personal data record 112 b,the value in action category 111 n may correspond to whether Beth Brownsubscribed to a mailing list for a product.

In an embodiment, the action 111 n comprises a binary value. Forexample, personal data record 112 a in FIG. 1 shows that Aaron Andersonof Des Moines, Iowa performed the action because the value in actioncategory 111 n is ‘1’ in personal data record 112 a. Conversely,personal data records 112 b and 112 n show that Beth Brown of New York,N.Y. and MinJung Ma of Detroit, Mich. did not perform the action becausethe value in action category 111 n is ‘0’ in personal data records 112 band 112 n. In other embodiments, the action 111 n is a real number. Inanother embodiment, the action 111 n comprises the number of occurrencesof a given event in relation to an individual.

Two personal data records 112 in the input set 110 may comprise the samevalues for a category 111. However, modeling engine 100 assumes thateach personal data record 112 in the input set 110 is associated with adistinct individual. Accordingly, in an embodiment, at least onecategory 111 contains a different value for any two personal datarecords 112 in input set 110. In another embodiment, the input set 110may contain duplicate personal data records 112.

Each personal data record 122 in the output set 120 comprises aplurality of categories 121 a-121 n. The categories 121 a-121 n in theoutput set 120 are not limited to the categories 111 a-111 n in theinput set 110. For example, categories 121 a-121 n of output set 120 mayinclude political party 121 d, credit score 121 e, and email address 121f, which may not be included in categories 111 a-111 n of input set 110.In an embodiment, each personal data record 122 in the output set 120comprises a last name 121 a, first name 121 b, age 121 c, politicalparty 121 d, credit score 121 e, email address 121 f, and outcome score121 n. As a skilled artisan would understand, the output set 120 is notlimited to the disclosed categories and may include a very large numberof categories, both textual and numerical. Furthermore, categories 121are not limited to the formatting illustrated in FIG. 1. For example,categories 111 a (“Last Name”) and 111 b (“First Name”) may instead becombined into a single category titled “Full Name.”

Each personal data record 122 in output set 120 includes an outcomescore 121 n. In an embodiment, outcome score 121 n represents theprobability that the individual associated with a personal data record122 will perform the action(s) corresponding to action category 111 n ofinput set 110. In another embodiment, outcome score 121 n represents theprobability that the individual associated with a personal data record122 will perform an action similar to the action(s) corresponding toaction category 111 n of input set 110. In another embodiment, outcomescore 121 n is monotonically related to the probability that theindividual associated with a personal data record 122 will perform theaction(s) corresponding to action category 111 n of input set 110. Whenaction 111 n represents the number of occurrences of an event, theoutcome score 121 n may predict the number of occurrences of the event.

Personal data records 122 in output set 120 correspond to differentindividuals than the individuals associated with personal data records112 in output set 110. Whether the individuals corresponding to outputset 120 will perform an underlying action associated with actioncategory 111 n is uncertain. Conversely, whether individualscorresponding to input set 110 performed the action is known. Datadescribing whether individuals corresponding to input set 110 performedthe action is provided to the modeling engine 100 via action category111 n. Accordingly, unlike the input set 110, the output set 120 doesnot include action category 111 n.

If action 111 n describes, for example, whether an individual associatedwith a certain personal data record 112 donated money to a specificpolitical campaign, then outcome score 121 n may describe a probabilitythat an individual associated with a certain personal data record 122 inoutput set 120 will donate money to that same political campaign. Aswith input set 110, however, output set 120 need not provide anyindication as to what underlying uncertain action the outcome score 121n pertains to. In an embodiment, the outcome score 121 n comprises adecimal value between 0.0 and 1.0. For example, personal data record 121a in FIG. 1 shows that Lyla Hanna will perform the action(s) associatedwith action category 111 n with probability 0.8. Similarly, personaldata records 122 b, 122 c, and 122 n show that Merita Sancha, AmritUkko, and Luisa Sechnaill will perform the action(s) associated withaction category 111 n with probabilities 0.6, 0.55, and 0.55,respectively.

FIG. 2 is a flowchart of an exemplary method for predicting actionsbased on personal data records. In an embodiment, FIG. 1's modelingengine 100 processes the input set 110 using the steps shown in FIG. 2to output the output set 120. The method begins at step 205 by receivingan input set of personal data records. In an embodiment, step 205comprises the modeling engine 100 receiving input set 110 of personaldata records 112.

The method continues at step 210 by cleaning the input set of personaldata records. The step of cleaning transforms the input set into aconsistent format useable for the remainder of the method. The inputset, for example, may not have sufficient structure or labeling for datamining. Cleaning step 210 may parse the data within each personal datarecord and assign the parsed data to predetermined categories that matchthe format of personal data records in lawfully stored databases.

In step 215, the method matches personal data records in the input setwith personal data records lawfully stored in a database. The term“matching” refers to determining that two or more personal data recordscorrespond to the same individual. Since personal data records in alawfully stored database may contain more categories than personal datarecords in the input set, matching a personal data record from the inputset to a personal data record in a lawfully stored database may enablethe use of more categories for training and testing a scoring enginewithin the modeling engine. Matching step 215 may comprise comparing thecategories of the cleaned input set with personal data records stored ina lawfully stored database using a pair-wise function. Based on thecomparison, matching step 215 may further comprise calculating asimilarity score for each pair. In an embodiment, when the similarityscore exceeds a predetermined threshold, matching step 215 may linkand/or combine the personal data records.

In step 220, the method forms a training set of personal data recordsfrom the matched data in the lawfully stored database. The personal datarecords in the training set may also be referred to as personal datatraining records. The training set is a set of personal data recordsfrom the lawfully stored database corresponding only to individualsrepresented by personal data records in the input set. In an embodiment,the training set is divided into two subsets. The first subset is usedto train the scoring engine in step 225, and the second set is used totest the scoring engine in step 230. In another embodiment, the trainingset of personal data records may be divided into a plurality of subsetssuch that may alternate being used to train the scoring engine in step225 and testing the scoring engine in step 230.

In step 225, the method trains the scoring engine using a subset of thetraining set of personal data records designated for training. Duringtraining, a model for the scoring engine is assumed. In an embodiment,the scoring engine is assumed to take the form of the followingfunction:

$P = \frac{e^{\theta^{T}x}}{1 + e^{\theta^{T_{x}}}}$

where P is the outcome score, e is Euler's number (approximately2.71828), θ is a column vector of parameters, and x is a column vectorof values corresponding to categories of a personal data record. In theabove equation, the letter “T” represents the vector transposeoperation. The vectors θ=[θ₁, θ₂, . . . , θ_(N)]^(T) and x=[x₁, x₂, . .. , x_(N)]^(T) are both of size N×1, where N is the number ofcategories, excluding the action category, in the personal data recordsthat form the training set. In other embodiments, insubstantial changesmay be made to the above prediction function. The insubstantial changesmay include adding small offsets, coefficients, and exponents.

In the above embodiment, a goal of training the scoring engine is tofind θ such that the outcome score P accurately predicts whether anindividual will perform the action(s) described by the action categoryof the training set based on the individual's personal data record(s).For the individuals represented in the training set, whether or not theindividual performed the action(s) is known. Provided with a largeenough training set, therefore, the goal of predicting outcome scoresfor individuals not represented in the training set may be approximatedby finding θ that minimizes a difference between P and the actioncategory for the training set, given a set of constraints on thestructure of θ.

In some embodiments, N>1000. In other words, the number of categoriesabout an individual may be over a thousand. In cases where a largenumber of categories exist, computation of the term θ^(T)x may becomputationally intractable over a large database, which may containhundreds of millions or billions of distinct personal data records(i.e., hundreds of millions of different x's). It may therefore beadvantageous to impose structure on θ such that θ_(i)=0 for most i. Whenθ_(i)=0, the ith category plays no role in the scoring engine and cantherefore be ignored. In effect, the size of vectors θ and x can bereduced from N×1 to {circumflex over (N)}×1 where {circumflex over(N)}<<N. Specific methods for minimizing {circumflex over (N)} whilemaintaining an accurate scoring engine are described in further detailbelow relative to FIG. 3.

After the scoring engine has been trained, step 230 tests the scoringengine using personal data records from the training set that were notused in training step 225. The personal data records employed in step230 may also be known as a test set. Testing compares the outcome scorepredicted for a personal data record in the test set with the actioncategory for that personal data record and assigns a predictor score tothe scoring engine.

In an embodiment, the predictor score is the mean squared error of theoutcome scores relative to the action categories for each personal datarecord in the test set. Mathematically, such a predictor score wouldtake the form

$S = {\frac{1}{R}{\sum\limits_{r = 1}^{R}\left( {P_{r} - A_{r}} \right)^{2}}}$

where S is the predictor score, P_(r) is the outcome score for personaldata record r, A_(r) is the value in the action category of personaldata record r, and R is the number of personal data records in the testset. In the above embodiment, P_(r) may be a decimal value rangingbetween 0 and 1, and A_(r) may be a binary number with a value either 0or 1.

In another embodiment, the predictor score is the percentage of correctpredictions when the outcome score is rounded to its nearest integervalue. In this case, the predictor score would take the form

$S = {\frac{1}{R}{\sum\limits_{r = 1}^{R}\left( {1 - \left( {\left\lfloor {P_{r} + 0.5} \right\rfloor - A_{r}} \right)^{2}} \right)}}$

where └P_(r)+0.5┘ rounds P_(r) to the nearest integer. As in themean-square error case, P_(r) may be a decimal value ranging between 0and 1, and A_(r) may be a binary number with a value either 0 or 1.

The steps of training 225 and testing 230 may be performed a number oftimes using different subsets of data from the training set to train 225and test 230 the scoring engine. For example, the scoring engine may betrained in four iterations using four different subsets of the trainingset. Personal data records in the training set not used to train thescoring engine may be used to test the scoring engine such that, in thepresent example, a specific personal data record is used to train thescoring engine in one iteration and is used to test the scoring enginethe other three iterations. The predictor scores for each iteration maybe compared and the trained scoring engine with highest predictor scoremay then be used to predict outcomes in subsequent steps.

In step 235, the scoring engine is used to predict an outcome score forpersonal data records in one or more lawfully stored databases. In anembodiment, the one or more lawfully stored databases are the same asthose used in step 215 to match the input set to databases of personaldata records. The personal data records in the lawfully stored databasesmay therefore have the same categories as the personal data records inthe training set. The outcome score P_(r) for a personal data record rmay be calculated as P_(r)=e^(θ) ^(T) ^(x) ^(r) /(1+e^(θ) ^(T) ^(x) ^(r)). For computational efficiency, the scoring engine may disregardcategories where θ_(i)<∈ and form a reduced parameter vector {circumflexover (θ)} and a reduced personal data record {circumflex over (x)}_(r)of size {circumflex over (N)}×1 where

>∈ for all

∈{circumflex over (θ)}. The outcome score may correspond to theprobability that an individual associated with a personal data recordwill perform the action(s) represented by the action category of thetraining set. In an embodiment, the scoring engine determines an outcomescore for a subset of the personal data records in the lawfully storeddatabase(s).

In step 240, the subset of personal data records processed by thescoring engine is output. In an embodiment, only the personal datarecords comprising the X highest outcome scores are output. In anotherembodiment, all personal data records with outcome scores greater thanan outcome threshold P₀ are output. In some embodiments, a subset of thecategories for each personal data record are output, and the outputcategories may not correspond to the categories used by the scoringengine. For example, telephone numbers are unlikely to be useful to thescoring engine in forming predictions about whether an individual willparticipate in a marketing campaign, but would be useful to output sincethe telephone number could be used to contact the individual. In otherembodiments, all of the categories are output.

In step 245, the method receives action values for previously outputpersonal data records. For example, in step 240 the method may haveoutput a personal data record corresponding to “Person X” with anoutcome score of 0.7. In response to receiving this personal datarecord, a user may contact “Person X” and, in effect, test the outcomescore of the scoring engine. The result of this test is an action valuethat can be delivered to the disclosed system to further refine thescoring engine. In step 250, for example, the output personal datarecord along with the newly discovered action value may be moved intothe training set. This updated training set may then be used to re-trainthe scoring engine in step 225 for improved accuracy. A skilled artisanwould understand that the initial scoring engine trained from theinitial training set may be sufficient, and therefore steps 245 and 250may be optional to the disclosed method.

FIG. 3 is a flowchart of an exemplary method for training a scoringengine. In an embodiment, this method is used to find a parameter vectorθ for the scoring engine modeled by the function P=e^(θ) ^(T)^(x)/(1+e^(θ) ^(T) ^(x)). Training may require a plurality ofiterations. In some embodiments, parameter α₁, described below, isdecreased at every iteration. In some embodiments, training continuesuntil the difference between successive predictor scores S_(i) andS_(i-1) is less than a predetermined threshold γ.

The method begins in step 305 by setting iterator variable k equal tozero. Next, in step 310, an initial α₁ is determined. In someembodiments, the initial α₁ is large. As described below relative tostep 315, the parameter α₁ controls how many elements of parametervector θ will be non-zero. A large α₁ may result in very few, if any,non-zero elements of parameter vector θ. Thus, parameter α₁ controls howmany categories of a personal data record are used for predicting theoutcome score for an individual. It does not, however, dictate whichcategories are to be used for this prediction.

In step 312, parameter selection is performed at each iteration.Parameter selection reduces the computational complexity of step 315 bysetting a subset of the values in θ to zero prior to solving for θ instep 315. The method does not solve for these values in step 315. In anembodiment, θ_(j) is set to 0 at iteration k whenever |x_(j) ^(T)(A−P(θ^((k−1))))|<γ, where A is the vector of action values. Thethreshold γ may be a function of α₁ and α₂ at previous iterations. In anembodiment, γ=α₂(2α₁ ^((k))−α₁ ^((k-1))).

In step 315, the method solves for the parameter vector θ that minimizesa cost function Y(θ, α₁). In an embodiment,

${{Y\left( {\theta,\alpha_{1}} \right)} = {{\frac{1}{M}{\sum_{m \in M}\left( {A_{m} - {P_{m}(\theta)}} \right)^{2}}} + {\alpha_{1}{\sum_{i = 1}^{N}{\theta_{i}}}} + {\alpha_{2}{\sum_{i = 1}^{N}{\theta_{i}}^{2}}}}},$

where M is the set of personal data records in the subset of thetraining set used for training, A_(m) is the value in the actioncategory for personal data record m∈M, P_(m)(θ)=e^(θ) ^(T) ^(x) ^(m)/(1+e^(θ) ^(T) ^(x) ^(m) ) is the outcome score for personal data recordm, and α₂ is a constant coefficient.

In another embodiment, Y(θ, α₁)=−Σ_(m∈M)A_(m) log P_(m)(θ)+(1−A_(m))log(1−P_(m)(θ))−α₁(α₂∥θ∥₁+1/2(1−α₂)∥θ∥₂ ²). In this embodiment, for agiven α₁, minimization of Y(θ, α₁) is known as Elastic Netregularization and may be performed using conventional methods as wouldbe understood by a person of skill in the art. The optimal parametervector at iteration i is denoted as θ_(i)*. In other embodiments,insubstantial changes may be made to the above cost functions. Theinsubstantial changes may include adding small offsets, coefficients,and exponents.

As can be seen from the above equation, the coefficient α₁ serves as aweight penalizing a large L1-norm for the vector θ (the L1-norm is∥θ∥₁=Σ_(i=1) ^(N)|θ_(i)|). Thus, the minimization will force elements ofparameter vector θ to zero while maintaining a large log-likelihood (orsmall mean squared error) in the outcome score. Choosing a large α₁will, accordingly, result in a large penalty for a solution with manynon-zero elements of parameter vector θ, and thus most categories willnot be considered for the scoring engine. Conversely, choosing α₁ toosmall results in almost no penalty for a solution with many non-zeroelements of parameter vector θ, and thus most categories will beconsidered for the scoring engine. In other words, the size of thesubset of categories considered by the scoring engine is inverselyrelated to the magnitude of the coefficient α₁.

In step 320, the method tests the accuracy of the parameter computed instep 315 using the subset of the training set known as the test set aspreviously described relative to FIG. 2. In an embodiment, step 320 inFIG. 3 corresponds to step 230 in FIG. 2. The method then continues tostep 325 by determining if i>0—that is, if the current iteration is notthe first iteration of the training process. If i=0, then the methoddecreases α₁ in step 330, increments i in step 335, and finds a newθ_(i)* in step 315. If i>0, then the method determines in step 340whether the accuracies of successive solutions are within a thresholddifference γ. If so, the training is said to have converged sincedecreasing α₁ (i.e., increasing the number of categories used by thescoring engine) in the previous iteration did not result insignificantly improved accuracy. In that case, the training process endsin step 345 by assigning the optimum solution θ*=θ_(i)*. Otherwise, themethod again decreases α₁ in step 330, increments i in step 335, andfinds a new θ_(i)* in step 315.

FIG. 4 illustrates an embodiment of the functional components of amodeling engine 450 for predicting actions based on personal datarecords. A user 400 interacts with the modeling engine 450 by providinginput data to the data cleaner 405 and receiving an output set ofpersonal data records with outcome scores from predictor 425. The inputdata may contain an action value for each entry as shown in FIG. 1. Thedata cleaner 405 may clean the data as described in above in relation tostep 210 in FIG. 2.

The data cleaner 405 passes the cleaned data to the data matcher 410.The data matcher 410 may match the records contained in the cleaned datato personal data records in the lawfully stored databases of personaldata records 430 based on determining that the same individualcorresponds to matching records. Data matcher 410 may match records asdescribed above in relation to step 215 in FIG. 2.

The data matcher 410 passes the matched data to the trainer 415. Withinthe trainer 415, the matched data is known as the training set. Aspreviously described, the trainer 415 may partition the training setinto subsets usable for either training or testing. The trainer 415 maytrain the scoring engine as describe above in relation to FIGS. 2 and 3.

The trainer 415 then passes the training data and the trained scoringengine to the tester 420. The tester 420 may test the accuracy of thetrained scoring engine as described above in relation to FIGS. 2 and 3.The trainer 415 and tester 420 may iterate to find the best scoringengine by varying which subsets of the training set are usable fortraining or testing. In an embodiment, the trainer 415 and tester 420may also iterate to find the best scoring engine by varying the ElasticNet parameter that effectively controls how many categories areconsidered by the scoring engine as described above in relation to FIG.3.

The tester 420 passes the final parameter vector θ* to the scoringengine 425, which applies the prediction function to the lawfully storeddatabases of personal data records 430. The scoring engine 425determines an outcome score for a subset of the personal data records inlawfully stored databases 430. In an embodiment, the subset of thepersonal data records in lawfully stored databases 430 is a strictsubset. In other embodiments, the subset is the entire database.

The scoring engine 425 outputs one or more personal data records fromthe subset to the user. In an embodiment, the output comprises a strictsubset of the categories of the personal data records in lawfully storeddatabases 430. In some embodiments, the output categories may notcoincide with the categories considered by the scoring engine to form anoutput score. In further embodiments, one or more categories may be anoutput category and also be considered by the scoring engine 425 to forman output score. In an embodiment, the output score is one of the outputcategories. In some embodiments, the user specifies how many personaldata records to output. In some embodiments, the user specifies thatonly personal data records with an outcome score above a threshold shallbe output. Furthermore, in some embodiments, the user may request outputdata from the modeling engine 450 without providing input data. Suchembodiments include a modeling engine 450 with a scoring engine 425 thathas previously been trained.

FIG. 5 illustrates a flow diagram of an exemplary method 500, forpredicting behaviors of individuals based on data produced/modifiedusing various embodiments of the foregoing methods and systems.

As shown, FIG. 5 depicts the contents of an exemplary data record 578describing an individual. In an embodiment, data record 578 may beconsumer data, such as an individual's purchasing history; web-browsingdata, such as an individual's browsing and/or web-purchase history,lawfully tracked using first- or third-party cookies; an individual'smortgage history; or any other personal data that may be lawfullytracked or purchased through commonly used methods.

At step 580, a processor (such as processor 104 or computing device(s)126 of FIG. 1) accesses, parses, and categorizes data record 278 inaccordance with the foregoing embodiments, resulting in categorized datarecord 582.

At step 584, the processor compares categorized data record 582 againstadditional data records in order to determine whether categorized datarecord 582 should be linked, grouped, and modified to mirror theidentity described by separate data record, in accordance with theforegoing embodiments. Resulting from step 584 is training data 586.

At step 588, training data 586 is entered into a training system inorder to compare and find individuals possessing similar interests,preferences, and other demographic data. Step 588 further includespredicting future behaviors of similar individuals, based on an order ofsimilarity between the individuals. For example, after comparing thetraining data 586 against additional data records, an outcome score maybe calculated. Step 588 returns output data 590, having outcome score592.

It is to be appreciated that the Detailed Description section, and notthe Summary and Abstract sections (if any), is intended to be used tointerpret the claims. The Summary and Abstract sections (if any) may setforth one or more but not all exemplary embodiments of the invention ascontemplated by the inventor(s), and thus, are not intended to limit theinvention or the appended claims in any way.

While the invention has been described herein with reference toexemplary embodiments for exemplary fields and applications, it shouldbe understood that the invention is not limited thereto. Otherembodiments and modifications thereto are possible, and are within thescope and spirit of the invention. For example, and without limiting thegenerality of this paragraph, embodiments are not limited to thesoftware, hardware, firmware, and/or entities illustrated in the figuresand/or described herein. Further, embodiments (whether or not explicitlydescribed herein) have significant utility to fields and applicationsbeyond the examples described herein.

Embodiments have been described herein with the aid of functionalbuilding blocks illustrating the implementation of specified functionsand relationships thereof. The boundaries of these functional buildingblocks have been arbitrarily defined herein for the convenience of thedescription. Alternate boundaries can be defined as long as thespecified functions and relationships (or equivalents thereof) areappropriately performed. Also, alternative embodiments may performfunctional blocks, steps, operations, methods, etc. using orderingsdifferent than those described herein.

References herein to “one embodiment,” “an embodiment,” “an exampleembodiment,” or similar phrases, indicate that the embodiment describedmay include a particular feature, structure, or characteristic, butevery embodiment may not necessarily include the particular feature,structure, or characteristic. Moreover, such phrases are not necessarilyreferring to the same embodiment. Further, when a particular feature,structure, or characteristic is described in connection with anembodiment, it would be within the knowledge of persons skilled in therelevant art(s) to incorporate such feature, structure, orcharacteristic into other embodiments whether or not explicitlymentioned or described herein.

The breadth and scope of the invention should not be limited by any ofthe above-described exemplary embodiments, but should be defined only inaccordance with the following claims and their equivalents.

What is claimed is:
 1. A computer implemented method, comprising:dividing a training set of personal data records into a first subset anda second subset, wherein each personal data record of the training setincludes an action category corresponding to an action taken by anindividual corresponding to the personal data record; training a firstprediction function with the first subset; in response to training thefirst prediction function, applying the first prediction function to thesecond subset to generate a first predictor score indicating accuracy ofthe first prediction function based on action values for the actioncategory of each of the personal data records of the second subset;training a second prediction function with the second subset; inresponse to training the second prediction function, applying the secondprediction function to the first subset to generate a second predictorscore indicating accuracy of the second prediction function based onaction values for the action category of each of the personal datarecords of the first subset; determining that the first predictor scoreindicates a higher accuracy of action values relative to the secondpredictor score; in response to the determining, applying the firstprediction function to a personal data record from a data set differentfrom the training set to determine an outcome score indicating aprobability of an action value corresponding to the personal data recordfrom the data set.
 2. The computer implemented method of claim 1,wherein applying the first prediction function to the second subsetfurther comprises: determining a mean squared error of outcome values ofthe first prediction function relative to the action values for theaction category of each of the personal data records of the secondsubset.
 3. The computer implemented method of claim 1, furthercomprising: parsing data within the personal data records of thetraining set; and assigning labels corresponding to predeterminedcategories, wherein the labels include the action category.
 4. Thecomputer implemented method of claim 1, further comprising: comparing afirst personal data record of the training set with a second personaldata record stored in a database; and determining that the firstpersonal data record and the second personal data record correspond to acommon individual by determining that a similarity score based on apair-wise function exceeds a predetermined threshold.
 5. The computerimplemented method of claim 4, further comprising: in response to thedetermining, adding a category value from the second personal datarecord to the first personal data record in the training set; and inresponse to the adding, dividing the training set into the first subsetand the second subset.
 6. The computer implemented method of claim 1,wherein the action values indicate a number of occurrences of an action.7. The computer implemented method of claim 1, wherein the outcome scoreis a decimal value between 0.0 and 1.0.
 8. A system, comprising: amemory; and at least one processor coupled to the memory and configuredto: divide a training set of personal data records into a first subsetand a second subset, wherein each personal data record of the trainingset includes an action category corresponding to an action taken by anindividual corresponding to the personal data record; train a firstprediction function with the first subset; in response to training thefirst prediction function, apply the first prediction function to thesecond subset to generate a first predictor score indicating accuracy ofthe first prediction function based on action values for the actioncategory of each of the personal data records of the second subset;train a second prediction function with the second subset; in responseto training the second prediction function, apply the second predictionfunction to the first subset to generate a second predictor scoreindicating accuracy of the second prediction function based on actionvalues for the action category of each of the personal data records ofthe first subset; determine that the first predictor score indicates ahigher accuracy of action values relative to the second predictor score;in response to the determining, apply the first prediction function to apersonal data record from a data set different from the training set todetermine an outcome score indicating a probability of an action valuecorresponding to the personal data record from the data set.
 9. Thesystem of claim 8, wherein to apply the first prediction function to thesecond subset, the at least one processor is further configured to:determine a mean squared error of outcome values of the first predictionfunction relative to the action values for the action category of eachof the personal data records of the second subset.
 10. The system ofclaim 8, wherein the at least one processor is further configured to:parse data within the personal data records of the training set; andassign labels corresponding to predetermined categories, wherein thelabels include the action category.
 11. The system of claim 8, whereinthe at least one processor is further configured to: compare a firstpersonal data record of the training set with a second personal datarecord stored in a database; and determine that the first personal datarecord and the second personal data record correspond to a commonindividual by determining that a similarity score based on a pair-wisefunction exceeds a predetermined threshold.
 12. The system of claim 11,wherein the at least one processor is further configured to: in responseto the determining, add a category value from the second personal datarecord to the first personal data record in the training set; and inresponse to the adding, divide the training set into the first subsetand the second subset.
 13. The system of claim 8, wherein the actionvalues indicate a number of occurrences of an action.
 14. The system ofclaim 8, wherein the outcome score is a decimal value between 0.0 and1.0.
 15. A non-transitory computer-readable device having instructionsstored thereon that, when executed by at least one computing device,cause the at least one computing device to perform operationscomprising: dividing a training set of personal data records into afirst subset and a second subset, wherein each personal data record ofthe training set includes an action category corresponding to an actiontaken by an individual corresponding to the personal data record;training a first prediction function with the first subset; in responseto training the first prediction function, applying the first predictionfunction to the second subset to generate a first predictor scoreindicating accuracy of the first prediction function based on actionvalues for the action category of each of the personal data records ofthe second subset; training a second prediction function with the secondsubset; in response to training the second prediction function, applyingthe second prediction function to the first subset to generate a secondpredictor score indicating accuracy of the second prediction functionbased on action values for the action category of each of the personaldata records of the first subset; determining that the first predictorscore indicates a higher accuracy of action values relative to thesecond predictor score; in response to the determining, applying thefirst prediction function to a personal data record from a data setdifferent from the training set to determine an outcome score indicatinga probability of an action value corresponding to the personal datarecord from the data set.
 16. The non-transitory computer-readabledevice of claim 15, wherein applying the first prediction function tothe second subset further comprises: determining a mean squared error ofoutcome values of the first prediction function relative to the actionvalues for the action category of each of the personal data records ofthe second subset.
 17. The non-transitory computer-readable device ofclaim 15, the operations further comprising: parsing data within thepersonal data records of the training set; and assigning labelscorresponding to predetermined categories, wherein the labels includethe action category.
 18. The non-transitory computer-readable device ofclaim 15, the operations further comprising: comparing a first personaldata record of the training set with a second personal data recordstored in a database; and determining that the first personal datarecord and the second personal data record correspond to a commonindividual by determining that a similarity score based on a pair-wisefunction exceeds a predetermined threshold.
 19. The non-transitorycomputer-readable device of claim 18, the operations further comprising:in response to the determining, adding a category value from the secondpersonal data record to the first personal data record in the trainingset; and in response to the adding, dividing the training set into thefirst subset and the second subset.
 20. The non-transitorycomputer-readable device of claim 15, wherein the action values indicatea number of occurrences of an action.